Hadoop Performance Tuning - A Pragmatic & Iterative Approach
نویسنده
چکیده
Hadoop represents a Java-based distributed computing framework that is designed to support applications that are implemented via the MapReduce programming model. In general, workload dependent Hadoop performance optimization efforts have to focus on 3 major categories. Namely the systems HW, the systems SW, and the configuration and tuning/optimization of the Hadoop infrastructure components. From a systems HW perspective, it is paramount to balance the appropriate HW components in regards to performance, scalability, and cost. It has to be pointed out that Hadoop is classified as a highly-scalable, but not necessarily as a high-performance cluster solution. From a SW perspective, the choice of the OS, the JVM, the specific Hadoop version, as well as other SW components necessary to run the Hadoop setup do have a profound impact on performance and stability of the environment. The design, setup, configuration, and tuning phase of any Hadoop project is paramount to fully benefit from the distributed Hadoop HW and SW solution stack.
منابع مشابه
Towards an Ontology-Based Semantic Approach to Tuning Parameters to Improve Hadoop Application Performance
Hadoop MapReduce assists companies and researchers to deal with processing large volumes of data. Hadoop has a lot of configuration parameters that must be tuned in order to obtain a better application performance. However, the best tuning of the parameters is not easily obtained by inexperienced users. Therefore, it is necessary to create environments that promote and motivate information shar...
متن کاملMaster’s Thesis: A Tuning Approach Based on Evolutionary Algorithm and Data Sampling for Boosting Performance of MapReduce Programs
The Apache Hadoop data processing software is immersed in a complex environment composed of huge machine clusters, large data sets, and several processing jobs. Managing a Hadoop environment is time consuming, toilsome and requires expert users. Thus, lack of knowledge may entail misconfigurations degrading the cluster performance. Indeed, users spend a lot of time tuning the system instead of ...
متن کاملOptimization Framework for Map Reduce Clusters on Hadoop’s Configuration
ARTICLE INFO Hadoop represents a Java-based distributed computing framework that is designed to support applications that are implemented via the MapReduce programming model. Hadoop performance however is significantly affected by the settings of the Hadoop configuration parameters. Unfortunately, manually tuning these parameters is very time-consuming. Existing system uses Random forest approa...
متن کاملHiTune: Dataflow-Based Performance Analysis for Big Data Cloud
Although Big Data Cloud (e.g., MapReduce, Hadoop and Dryad) makes it easy to develop and run highly scalable applications, efficient provisioning and finetuning of these massively distributed systems remain a major challenge. In this paper, we describe a general approach to help address this challenge, based on distributed instrumentations and dataflow-driven performance analysis. Based on this...
متن کاملSome Workload Scheduling Alternatives in a High Performance Computing Environment
Clusters of commodity microprocessors have overtaken custom-designed systems as the high performance computing (HPC) platform of choice. The design and optimization of workload scheduling systems for clusters has been an active research area. This paper surveys some examples of workload scheduling methods used in large-scale applications such as Google, Yahoo, and Amazon that use a MapReduce pa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013